Author: Anil Gurbuz
Date: 27 JUN 2020
I want to start with thanks to winning team of "Favorita Groceries Sales Competition" for sharing their interesting solution which achieved highest predictive performance only using the last 3 months of the data with a very small training time and let me learn a lot from. This notebook is a adaptation of that approach to M5 forecasting competition.
Instead of representing sales of each day as a row in the training data as in most of the public kernels, each series are represented as a row --similar to the given format by the competition organisers--. This is particularly advantagous for code efficiency as we don't need to use pandas "groupby" which is horribly slow especially when the number of groups are high as in this data -- 30490 groups --.
I will be creating 200+ features for the purpose of this first version of the notebook. They are mostly running statistics on sales, calendar, intermittent demand and price related features -- some are adjusted versions of features for this approach from other public kernels --.
All feature creations, and training of 28 models finishes around 20mins.
I have created my pipeline with regualar approach and also developed this one to discover how this interesting modelling technique works. I can say that in terms of the final WRMSSE score, almost always the regular approach performed better even though it was based on only some basic features relative to this approach. For the best --that I managed to achieve-- of each approach were not so different. My best 28days model with regular approach gave LB 0.61XX whereas with this approach, I managed to get LB 0.63XX none of them use recursive features or magic multipliers. I used quite a lot of complex features that I prefered to keep it to myself until the end of the competition though I can say that they are mainly targeting to capture SNAP-sales interactions.
In terms of the performance, this version of the notebook is intended to be a starter code for those of you that are willing to explore this approach. I can easily say that it taught me a lot about how wide range of ways are possible to model a problem. I will try to explain how this model works in more detail in the upcoming steps.
For each series we look back from some certain points in time -- with one week gaps in between-- and derive features to describe the behaviour of the series up until that point in time.
Above image is showing the points in time that the features are generated. Gaps between are one week which aims to capture weekly cycle in the data. Arrows are looking back to represent that the features at that point are created by looking back in time.
Once we have our features generated, we create 28 models which learns the relationship between these features of a series and the next 1,2,3....28 days sales. Each of these models are designed to learn the relationship between the same derived features and different forecasting horizons so for each training example we have 28 different labels.
Ability to quickly try ideas is critical in here because most of people are struggling to create a trust worthy CV strategy --including me-- so we can search for correlation of CV and LB much more quickly. Also, if you have an existing CV that requires hours of time and you got sick of it, you may prefer to pivot your approach to this one.
I calculated the WRMSSE weights and constants in my local computer and loaded it to the kernel using a pickle file so the additional .pkl files are just these weights.
Even though code works quite fast, I created some check point like save_obj(), load_obj() functions and commented them out for those of you that want to avoid each time running the pre-processing and feature engineering parts of the code.
I mainly created 4 datasets in pre-processing to derive the features from. snap_df is not used in this notebook but It is useful to develop further features.
import wrmsse_utility as w
from datetime import date, timedelta
import gc
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import pickle
from tqdm import tqdm
import lightgbm as lgb
import random
import warnings
warnings.filterwarnings('ignore')
def save_obj(obj, name):
with open( name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
# Define functions for loading pickle files during upcoming steps.
def load_weights2(name):
with open('../input/weights2/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
def load_avoid_spike(name):
with open('../input/avoid-spike/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
def load_from_scratch(name):
with open('../input/from-scratch/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
def load_clusters(name):
with open('../input/clusters/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
train_df = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv")
calendar = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv")
price = pd.read_csv("../input/m5-forecasting-accuracy/sell_prices.csv")
submission = pd.read_csv("../input/m5-forecasting-accuracy/sample_submission.csv")
category_cols = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
no_id_id_columns = ['item_id', 'store_id', 'cat_id', 'dept_id', 'state_id']
id_cols = ["id"] + category_cols
id_df = train_df[id_cols] # id_df not encoded
print("running")
# Label Encoding of categorical variables
mapper = {}
for col in category_cols:
le = LabelEncoder()
mapper[col] = dict(zip(le.fit_transform(train_df[col]), train_df[col]))
train_df[col] = le.fit_transform(train_df[col])
multi_indexes = train_df.set_index(category_cols).index # multi_indexes are encoded
INDEX_ITEM = multi_indexes.get_level_values("item_id")
INDEX_DEPT = multi_indexes.get_level_values("dept_id")
INDEX_STORE = multi_indexes.get_level_values("store_id")
# Create ordered_ids_and_weights -- Weights are pre-calculated
agg_level_to_denominator = load_weights2("agg_to_denominator2")
agg_level_to_weight = load_weights2("agg_to_weight2")
final_multiplier = (agg_level_to_weight[11]) / agg_level_to_denominator[11]
final_multiplier = final_multiplier.reset_index()
final_multiplier["id"] = final_multiplier["item_id"] + "_" + final_multiplier["store_id"] + "_evaluation"
final_multiplier.drop(["item_id", "store_id"], axis=1, inplace=True)
del agg_level_to_weight, agg_level_to_denominator
gc.collect()
ordered_ids_and_weights = pd.merge(id_df, final_multiplier, on=["id"], how="left")
ordered_ids_and_weights = ordered_ids_and_weights[[0]]
ordered_ids_and_weights.index = multi_indexes
ordered_ids_and_weights = ordered_ids_and_weights.reset_index()
ordered_ids_and_weights = ordered_ids_and_weights.rename({0: "weights"}, axis=1)
# Trainset set column names
train_df.set_index(keys=id_cols, inplace=True)
start_date = date(2011, 1, 29)
train_df.columns = pd.date_range(start_date, freq="D", periods=1941)
train_df.reset_index(inplace=True)
# Calendar
calendar["date"] = pd.to_datetime(calendar.date)
# Preprocess price -- represent dates as columns
price_df = pd.merge(price, id_df, how="left", on=["item_id", "store_id"])
tmp = calendar[["wm_yr_wk", "date"]]
tmp = tmp.groupby("wm_yr_wk").agg(list).reset_index()
price_df = pd.merge(price_df, tmp, how="left", on="wm_yr_wk")
price_df = price_df.explode("date")
price_df.drop(["wm_yr_wk"], axis=1, inplace=True)
price_df = price_df.set_index(id_cols + ["date"])
price_df = price_df[["sell_price"]].unstack()
price_df.columns = price_df.columns.droplevel()
price_df.reset_index(inplace=True)
# Preprocess Calendar and SNAP
tmp = calendar[["date", "event_name_1"]]
tmp2 = tmp.dropna(axis=0)
calendar_df = pd.DataFrame(columns=pd.date_range(start_date, freq="D", periods=1913 + 56),
index=tmp2.event_name_1.unique())
tmp3 = tmp2.groupby("event_name_1").agg(list).reset_index()
a = zip(tmp3["event_name_1"], tmp3["date"])
for row, col in a: calendar_df.loc[row, col] = 1
snap_ca = pd.DataFrame(index=["CA"], columns=pd.date_range(start_date, freq="D", periods=1913 + 56))
snap_tx = pd.DataFrame(index=["TX"], columns=pd.date_range(start_date, freq="D", periods=1913 + 56))
snap_wa = pd.DataFrame(index=["WI"], columns=pd.date_range(start_date, freq="D", periods=1913 + 56))
snap_ca.loc["CA", :] = calendar["snap_CA"].values
snap_tx.loc["TX", :] = calendar["snap_TX"].values
snap_wa.loc["WI", :] = calendar["snap_WI"].values
snap_df = pd.concat([snap_ca, snap_tx, snap_wa])
calendar_df = calendar_df.fillna(0)
#Remove Repeating records in Calendar and create seperate series for NBA and Ramadan
calendar_df = calendar_df.loc[calendar_df.index!= "LentWeek2",]
calendar_df.iloc[-1,:] = calendar_df.iloc[6,:].values + calendar_df.iloc[-1,:].values # Easter Fix
calendar_df.iloc[-6,:] = calendar_df.iloc[-3,:].values + calendar_df.iloc[-6,:].values # Christmas Fix
NBA = (calendar_df.iloc[11,:].values + calendar_df.iloc[12,:].values).cumsum() %2
NBA = pd.Series(NBA, index=calendar_df.columns)
RAMADAN = (calendar_df.iloc[15,:].values + calendar_df.iloc[16,:].values).cumsum() %2
RAMADAN = pd.Series(RAMADAN, index=calendar_df.columns)
calendar_df = calendar_df.loc[~calendar_df.index.isin(["OrthodoxEaster","OrthodoxChristmas","NBAFinalsStart",\
"NBAFinalsEnd","Ramadan starts","Eid al-Fitr"]),]
event_mapper = dict(zip(list(range(1,24)),list(calendar_df.index)))
calendar_df = ((np.arange(1,24) * calendar_df.T).T).sum()
mapper_back_state = {v: k for k, v in mapper["state_id"].items()}
snap_df.index = snap_df.index.map(mapper_back_state)
del snap_ca, snap_tx, snap_wa, tmp, tmp2, tmp3, calendar, price, mapper_back_state
print("Dataframes are ready...")
# Select only the part of the data that will be used in training and feature engineering
def sample_from_train(train_df, start, end):
return train_df[ id_cols + list(pd.date_range(start, end))]
# Select the data between minus days before the date_from and date_from
def get_timespan(df, date_from, minus, periods, freq="D"):
return df[pd.date_range(date_from - timedelta(days=minus), periods=periods, freq=freq)]
# Create sales related features
def fe(df, date_from, get_label=True, name_prefix=None):
X = dict()
# Windows size 3,7,30,180 -- Selected by intuition and trail
for i in [3, 7, 30, 180]:
tmp = get_timespan(df, date_from, i, i)
X['diff_%s_mean' % i] = tmp.diff(axis=1).mean(axis=1).values
X['mean_%s_decay' % i] = (tmp * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
X['mean_%s' % i] = tmp.mean(axis=1).values
X['median_%s' % i] = tmp.median(axis=1).values
X['min_%s' % i] = tmp.min(axis=1).values
X['max_%s' % i] = tmp.max(axis=1).values
X['std_%s' % i] = tmp.std(axis=1).values
# Windows size 3,7,15,30 -- Selected by intuition and trail
for i in [3, 7, 15, 30]:
tmp = get_timespan(df, date_from + timedelta(days=-7), i, i)
X['diff_%s_mean_2' % i] = tmp.diff(axis=1).mean(axis=1).values
X['mean_%s_decay_2' % i] = (tmp * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
X['mean_%s_2' % i] = tmp.mean(axis=1).values
X['median_%s_2' % i] = tmp.median(axis=1).values
X['min_%s_2' % i] = tmp.min(axis=1).values
X['max_%s_2' % i] = tmp.max(axis=1).values
X['std_%s_2' % i] = tmp.std(axis=1).values
# Windows size 3,7,13,30,180 -- Selected by intuition and trail
for i in [3, 7, 14, 30, 180]:
tmp = get_timespan(df, date_from, i, i)
X['has_sales_days_in_last_%s' % i] = (tmp > 0).sum(axis=1).values
X['last_has_sales_day_in_last_%s' % i] = i - ((tmp > 0) * np.arange(i)).max(axis=1).values
X['first_has_sales_day_in_last_%s' % i] = ((tmp > 0) * np.arange(i, 0, -1)).max(axis=1).values
X['Number_of_days_to_max_sales_in_last_%s' % i] = (pd.to_datetime(date_from) - pd.to_datetime(
get_timespan(df, date_from, i, i).idxmax(axis=1).values)).days.values
X['Number_of_days_to_min_sales_in_last_%s' % i] = (pd.to_datetime(date_from) - pd.to_datetime(
get_timespan(df, date_from, i, i).idxmin(axis=1).values)).days.values
# Lag features of last 7 days
for i in range(1, 8):
X["lag_%s" % i] = get_timespan(df, date_from, i, 1).values.ravel()
# Weekday avg. of 4,8,13,26,52 weeks windows
for i in range(7):
X["mean_4_dow_%s" % i] = get_timespan(df, date_from, 4 * 7 - i, 4, freq="7D").mean(axis=1).values
X["mean_8_dow_%s" %i] = get_timespan(df, date_from, 8 * 7 - i, 8, freq="7D").mean(axis=1).values
X["mean_13_dow_%s" % i] = get_timespan(df, date_from, 13 * 7 - i, 13, freq="7D").mean(axis=1).values
X["mean_26_dow_%s" % i] = get_timespan(df, date_from, 26 * 7 - i, 26, freq="7D").mean(axis=1).values
X["mean_52_dow_%s" % i] = get_timespan(df, date_from, 52 * 7 - i, 52, freq="7D").mean(axis=1).values
X = pd.DataFrame(X)
if name_prefix is not None:
X.columns = ['%s_%s' % (name_prefix, c) for c in X.columns]
return X
# LABELS ARE SALES VALUES OF NEXT 28 DAYS
if get_label:
y = df[pd.date_range(date_from, periods=28)].values
return X, y
else:
return X
# Create event related features
def calendar_fe(calendar_df, date_from, number_of_series=30490, add_bulk=False):
X={}
for i in [7, 14, 28]:
tmp = get_timespan(calendar_df, date_from, i, i)
if ((tmp > 0) * np.arange(i)).max() == 0:
X["days_after_last_event_in_last_%s_days" %i] = np.repeat(np.NaN,number_of_series)
else:
X["days_after_last_event_in_last_%s_days" %i] = np.repeat(i - ((tmp > 0) * np.arange(i)).max(), number_of_series)
for i in [7, 14, 28]:
tmp = get_timespan(calendar_df, date_from + timedelta(days=i), i, i)
if ((tmp > 0 ) * np.arange(i, 0, -1)).max() ==0:
X["days_to_NEXT_event_in_%s"%i] = np.repeat(np.NaN,number_of_series)
else:
X["days_to_NEXT_event_in_%s"%i] = np.repeat(i - ((tmp > 0 ) * np.arange(i, 0, -1)).max(), number_of_series)
X = pd.DataFrame(X)
nba_tmp = get_timespan(NBA, date_from + timedelta(days=28), 28, 28)
nba_tmp = np.array([nba_tmp]*30490)
nba_tmp = pd.DataFrame(nba_tmp, columns=["NBA_AT_NEXT_%s_CATEGORICAL"%i for i in range(1,29)])
ramadan_tmp = get_timespan(RAMADAN, date_from + timedelta(days=28), 28, 28)
ramadan_tmp = np.array([ramadan_tmp]*30490)
ramadan_tmp = pd.DataFrame(ramadan_tmp, columns=["RAMADAN_AT_NEXT_%s_CATEGORICAL"%i for i in range(1,29)])
cal_tmp = get_timespan(calendar_df, date_from + timedelta(days=28), 28, 28)
cal_tmp = np.array([cal_tmp]*30490)
cal_tmp = pd.DataFrame(cal_tmp, columns=["EVENT_AT_NEXT_%s_CATEGORICAL"%i for i in range(1,29)])
X = pd.concat([X,nba_tmp,ramadan_tmp,cal_tmp],axis=1)
return X
def price_fe(price_df, date_from):
X={}
for i in [28]:
tmp = get_timespan(price_df, date_from + timedelta(days=i), i, i)
X["max_price_NEXT_%s_days" %i] = tmp.max(axis=1).values
X["min_price_NEXT_%s_days" % i] = tmp.min(axis=1).values
X["days_to_firt_price_drop_in_NEXT_%s_days_CATEGORICAL"] = (((tmp.diff(axis=1)<0)*np.arange(i)).max(axis=1)).replace(0,np.NaN)
X["days_to_firt_price_increase_in_NEXT_%s_days_CATEGORICAL"] = (((tmp.diff(axis=1)>0)*np.arange(i)).max(axis=1)).replace(0,np.NaN)
for i in [7,28,180]:
tmp = get_timespan(price_df, date_from, i, i)
X["days_after_last_price_drop_%s_days"%i] = (i - ((tmp.diff(axis=1)<0)*np.arange(i)).max(axis=1)).replace(i,np.NaN)
X["days_after_last_price_increse_%s_days"%i] = (i - ((tmp.diff(axis=1)>0)*np.arange(i)).max(axis=1)).replace(i,np.NaN)
X["max_price_last_%s_days" % i] = tmp.max(axis=1).values
X["min_price_last_%s_days" % i] = tmp.min(axis=1).values
X["percent_price_change_last_%s_days" % i] = (X["max_price_last_%s_days" % i] - X[
"min_price_last_%s_days" % i]) / X["min_price_last_%s_days" % i]
X["price_NA_last_%s_days" % i] = tmp.isna().sum(axis=1).values
X = pd.DataFrame(X)
return X
def create_train_and_val_as_list_of_df(train_df, item_df, store_dept_df, calendar_df, price_df, multi_indexes, val_from, number_of_weeks):
X_l = []
y_l = []
weights = ordered_ids_and_weights.copy()
for i in tqdm(range(number_of_weeks), leave=False):
dt_from = val_from + timedelta(days=- i*7)
X, y = fe(train_df, dt_from, get_label=True)
X_item = fe(item_df, dt_from, get_label=False, name_prefix="Item")
X_item = X_item.reindex(INDEX_ITEM).reset_index(drop=True)
X_store_dept = fe(store_dept_df, dt_from, get_label=False, name_prefix="Store_Dept")
X_store_dept.index = original_store_dept_index
X_store_dept = X_store_dept.reindex(pd.MultiIndex.from_arrays([INDEX_STORE,INDEX_DEPT])).reset_index(drop=True)
X_calendar = calendar_fe(calendar_df, dt_from)
X_price = price_fe(price_df, dt_from)
weights["weights"] *= (0.997**i)
X_l.append(w.reduce_mem_usage(pd.concat([X, X_item, X_store_dept, X_calendar, X_price, weights], axis=1)))
y_l.append(y)
return X_l, y_l
# Create same features this time to describe the status series just before the test start date
def create_test_as_df(train_df, item_df, store_dept_df, calendar_df, price_df, multi_indexes, test_from):
X_test = fe(train_df, test_from, get_label=False)
X_item = fe(item_df, test_from, get_label=False, name_prefix="Item")
X_item = X_item.reindex(INDEX_ITEM).reset_index(drop=True)
X_store_dept = fe(store_dept_df, test_from, get_label=False, name_prefix="Store_Dept")
X_store_dept.index = original_store_dept_index
X_store_dept = X_store_dept.reindex(pd.MultiIndex.from_arrays([INDEX_STORE,INDEX_DEPT])).reset_index(drop=True)
X_calendar = calendar_fe(calendar_df, test_from)
X_price = price_fe(price_df, test_from)
return pd.concat([X_test, X_item, X_store_dept, X_calendar, X_price, ordered_ids_and_weights], axis=1)
# define cost function -- From Ragnar's
def custom_asymmetric_train(y_pred, y_true):
y_true = y_true.get_label()
residual = (y_true - y_pred).astype("float")
grad = np.where(residual < 0, -2 * residual, -2 * residual * 1.10)
hess = np.where(residual < 0, 2, 2 * 1.10)
return grad, hess
train_start = date(2013,1,1)
train_end = date(2016, 4, 24)
validation_start = date(2016, 4, 25)
validation_end = date(2016, 5, 22)
test_start = date(2016, 5, 23)
test_end = date(2016, 6, 19)
number_of_weeks = 60
#### IF wanted to try CV on last years periods --> validation weeks will be popped with [48, 100] ####
### pd.date_range includes both sides when start and end given ###
### only includes start and adds periods when used with periods and freq ###
# Select fresh examples
train_df = sample_from_train(train_df, train_start, validation_end)
price_df = sample_from_train(price_df, train_start, test_end)
item_df = train_df.groupby(["item_id"]).sum().iloc[:,4:]
store_dept_df = train_df.groupby(["store_id","dept_id"]).sum().iloc[:,3:]
original_store_dept_index = store_dept_df.index
# Create train, labels and test
X_l, y_l = create_train_and_val_as_list_of_df(train_df, item_df, store_dept_df, calendar_df, price_df,\
multi_indexes, validation_start, number_of_weeks)
test_X = create_test_as_df(train_df, item_df, store_dept_df, calendar_df, price_df, multi_indexes, test_start)
val_X = X_l.pop(0)
val_y = y_l.pop(0)
#Make train_X and train labels
train_X = pd.concat(X_l, axis=0)
train_y = np.concatenate(y_l, axis=0)
del X_l,y_l
gc.collect()
excld_features = ["Store_Dept_last_has_sales_day_in_last_3",
"Store_Dept_has_sales_days_in_last_7",
"Store_Dept_last_has_sales_day_in_last_7",
"Store_Dept_first_has_sales_day_in_last_7",
"Store_Dept_last_has_sales_day_in_last_14",
"Store_Dept_first_has_sales_day_in_last_14",
"Store_Dept_last_has_sales_day_in_last_30",
"Store_Dept_first_has_sales_day_in_last_30",
"Store_Dept_last_has_sales_day_in_last_180",
"Store_Dept_first_has_sales_day_in_last_180",
"NBA_AT_NEXT_1_CATEGORICAL",
"NBA_AT_NEXT_2_CATEGORICAL",
"NBA_AT_NEXT_3_CATEGORICAL",
"NBA_AT_NEXT_4_CATEGORICAL",
"NBA_AT_NEXT_5_CATEGORICAL",
"NBA_AT_NEXT_6_CATEGORICAL",
"NBA_AT_NEXT_7_CATEGORICAL",
"NBA_AT_NEXT_8_CATEGORICAL",
"NBA_AT_NEXT_9_CATEGORICAL",
"NBA_AT_NEXT_10_CATEGORICAL",
"NBA_AT_NEXT_11_CATEGORICAL",
"NBA_AT_NEXT_12_CATEGORICAL",
"NBA_AT_NEXT_13_CATEGORICAL",
"NBA_AT_NEXT_14_CATEGORICAL",
"NBA_AT_NEXT_15_CATEGORICAL",
"NBA_AT_NEXT_16_CATEGORICAL",
"NBA_AT_NEXT_17_CATEGORICAL",
"NBA_AT_NEXT_18_CATEGORICAL",
"NBA_AT_NEXT_19_CATEGORICAL",
"NBA_AT_NEXT_20_CATEGORICAL",
"NBA_AT_NEXT_21_CATEGORICAL",
"NBA_AT_NEXT_22_CATEGORICAL",
"NBA_AT_NEXT_23_CATEGORICAL",
"NBA_AT_NEXT_24_CATEGORICAL",
"NBA_AT_NEXT_25_CATEGORICAL",
"NBA_AT_NEXT_26_CATEGORICAL",
"NBA_AT_NEXT_27_CATEGORICAL",
"NBA_AT_NEXT_28_CATEGORICAL",
"RAMADAN_AT_NEXT_1_CATEGORICAL",
"RAMADAN_AT_NEXT_2_CATEGORICAL",
"RAMADAN_AT_NEXT_3_CATEGORICAL",
"RAMADAN_AT_NEXT_4_CATEGORICAL",
"RAMADAN_AT_NEXT_5_CATEGORICAL",
"RAMADAN_AT_NEXT_6_CATEGORICAL",
"RAMADAN_AT_NEXT_7_CATEGORICAL",
"RAMADAN_AT_NEXT_8_CATEGORICAL",
"RAMADAN_AT_NEXT_9_CATEGORICAL",
"RAMADAN_AT_NEXT_10_CATEGORICAL",
"RAMADAN_AT_NEXT_11_CATEGORICAL",
"RAMADAN_AT_NEXT_12_CATEGORICAL",
"RAMADAN_AT_NEXT_13_CATEGORICAL",
"RAMADAN_AT_NEXT_14_CATEGORICAL",
"RAMADAN_AT_NEXT_15_CATEGORICAL",
"RAMADAN_AT_NEXT_16_CATEGORICAL",
"RAMADAN_AT_NEXT_17_CATEGORICAL",
"RAMADAN_AT_NEXT_18_CATEGORICAL",
"RAMADAN_AT_NEXT_19_CATEGORICAL",
"RAMADAN_AT_NEXT_20_CATEGORICAL",
"RAMADAN_AT_NEXT_21_CATEGORICAL",
"RAMADAN_AT_NEXT_22_CATEGORICAL",
"RAMADAN_AT_NEXT_23_CATEGORICAL",
"RAMADAN_AT_NEXT_24_CATEGORICAL",
"RAMADAN_AT_NEXT_25_CATEGORICAL",
"RAMADAN_AT_NEXT_26_CATEGORICAL",
"RAMADAN_AT_NEXT_27_CATEGORICAL",
"RAMADAN_AT_NEXT_28_CATEGORICAL",
"EVENT_AT_NEXT_2_CATEGORICAL",
"EVENT_AT_NEXT_6_CATEGORICAL",
"EVENT_AT_NEXT_9_CATEGORICAL",
"EVENT_AT_NEXT_16_CATEGORICAL",
"EVENT_AT_NEXT_18_CATEGORICAL",
"EVENT_AT_NEXT_21_CATEGORICAL",
"EVENT_AT_NEXT_23_CATEGORICAL",
"EVENT_AT_NEXT_25_CATEGORICAL",
"EVENT_AT_NEXT_26_CATEGORICAL",
"EVENT_AT_NEXT_28_CATEGORICAL",
"days_to_firt_price_drop_in_NEXT_%s_days_CATEGORICAL",
"days_to_firt_price_increase_in_NEXT_%s_days_CATEGORICAL",
"days_after_last_price_increse_7_days",
"price_NA_last_7_days",
"price_NA_last_28_days",
"dept_id"]
train_X = train_X.drop(excld_features, axis=1)
val_X = val_X.drop(excld_features, axis=1)
test_X = test_X.drop(excld_features, axis=1)
features = [col for col in val_X.columns if col != "weights"]
category_features = [col for col in features if "_CATEGORICAL" in col]
submission_name="submission.csv"
print("Number of Features: ", len(features))
gc.collect()
submission = pd.read_csv("../input/m5-forecasting-accuracy/sample_submission.csv")
df_val_pred = submission.iloc[:30490, 1:].copy()
submission = submission.iloc[30490:, 1:].reset_index(drop=True)
# Pre determined clusters. Generated in the given link here:
# https://www.kaggle.com/anlgrbz/clustering-with-intermittent-demand-related-featrs
id0 = load_clusters("id_gm0")
id1 = load_clusters("id_gm1")
id2 = load_clusters("id_gm2")
idx0 = id0.index
idx1 = id1.index
idx2 = id2.index
train_y = pd.DataFrame(train_y,index=train_X.index)
val_y = pd.DataFrame(val_y,index=val_X.index)
# Training configurations. Using early stopping by setting last 28 days as validation set.
# For final model to use, this early stopping can be changed with observed number of iterations
# and add the last 28 days to training set as it includes crucial information.
params = {
'boosting_type': 'gbdt',
'metric': 'rmse',
'objective': 'custom',
#"tweedie_variance_power": 1.1,
'n_jobs': -1,
'seed': 236,
"num_leaves": 63,
'learning_rate': 0.1,
'bagging_fraction': 0.75,
'bagging_freq': 10,
'colsample_bytree': 0.6,
"num_boost_round": 2500,
"early_stopping_rounds": 50,
"min_data_in_leaf": 30}
# Model training loop for 28 different horizons
for i in range(28):
print("=" * 50)
print("Fold%s" % (i + 1))
print("=" * 50)
# For each horizon, there will be 3 models to trian similar products together based on their intermitten demand charachteristics.
for j, idx in enumerate([idx0,idx1,idx2]):
# Create LightGBM datasets for current fold
train_set = lgb.Dataset(train_X.loc[train_X.index.isin(idx),features],
train_y.loc[train_y.index.isin(idx), i],
weight=train_X.loc[train_X.index.isin(idx),"weights"],
categorical_feature=category_features)
val_set = lgb.Dataset(val_X.loc[val_X.index.isin(idx),features],
val_y.loc[val_y.index.isin(idx), i],
weight=val_X.loc[val_X.index.isin(idx),"weights"],
categorical_feature=category_features)
# Train model
model = lgb.train(params, train_set, valid_sets= [train_set,val_set], verbose_eval=50, fobj=custom_asymmetric_train)
# Make validation set predictions
val_preds = model.predict(val_X.loc[val_X.index.isin(idx),features])
df_val_pred.loc[df_val_pred.index.isin(idx), "F%s"%(i+1)] = val_preds
# Make predictions on test data and save into submission file
submission.loc[submission.index.isin(idx),"F%s"%(i+1)] = model.predict(test_X.loc[test_X.index.isin(idx),features])
submission.to_csv(submission_name, index=False)
# Diagnose the model by investigating some of the predictions.
sales_clean = pd.read_csv("../input/m5-forecasting-accuracy/" + 'sales_train_evaluation.csv')
cal = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv")
prices = pd.read_csv("../input/m5-forecasting-accuracy/sell_prices.csv")
train_df = sales_clean.iloc[:, :-28]
valid_df = sales_clean.iloc[:, -28:] # forecast for the last 28 days of train period
evaluator = w.WRMSSEEvaluator(train_df, valid_df, cal, prices)
df_val_pred.columns = ["d_"+ str(i) for i in range(1914,1942)]
# Calculate validation scores
evaluator.score(df_val_pred)
evaluator.print_more_scores()
# Thanks kaggle comunity for providing visualisation code
w.create_dashboard(evaluator, groups='all', cal=cal)